{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd \n",
    "pd.set_option('display.max_columns', 500)\n",
    "from tqdm import tqdm\n",
    "import numpy as np\n",
    "from matplotlib import pyplot as plt \n",
    "#import lightgbm as lgb\n",
    "from scipy.stats import mode\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from datetime import timedelta\n",
    "from pandas import pivot_table\n",
    "\n",
    "import seaborn as sns\n",
    "sns.set()\n",
    "%config InlineBackend.figure_format = 'svg'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Author Sergei Bulaev , Slack name: @ser-serege ,  Fall 2018"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1. Dataset and features description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### This dataset contains the history of customer transactions for 3 months of preferential use of the banking product.\n",
    "\n",
    "In the test file.the csv contains lines of c 518375 transactions made by the clients of the Bank. The cl_id column contains the internal client id. For each unique cl_id, you should predict whether the client will continue to use the product (target_flag). A value of 0 indicates failure and a value of 1 indicates continued use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Column        | Transcription                         |\n",
    "|---------------|---------------------------------------|\n",
    "|PERIOD         |transaction month                      |       \n",
    "|cl_id          |client id                              |\n",
    "|MCC            |seller category code                   |\n",
    "|channel_type   |customer engagement channel            |\n",
    "|currency       |currency                               |\n",
    "|TRDATETIME     |transaction date/time                  |\n",
    "|amount         |transaction amount                     |\n",
    "|trx_category   |type of transaction POS payment through\n",
    "|               |the POS terminal, C2C_OUT – transfer \n",
    "|               |(outgoing payment), C2C_IN – card      |\n",
    "|               |transaction (incoming payment), DEPOSIT| \n",
    "|               |card in the ATM, WD_ATM_PARTNER – cash |\n",
    "|               | withdrawals at ATMs partners\n",
    "|target_flag    |will the customer continue to use the product after the grace period (1/0) (target)\n",
    "|target_sum     | the amount of the transaction kind of POS for the three future months (target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#raw_df = pd.read_csv('Rosbankk.csv',error_bad_lines=False)\n",
    "#\n",
    "#raw_df.to_csv('rosbank_train.csv')\n",
    "#test = pd.read_csv('rosbank_test.csv',error_bad_lines=False)\n",
    "raw_df = pd.read_csv('Rosbankk.csv',error_bad_lines=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#raw_df = pd.read_csv('rosbank_train.csv',error_bad_lines=False)\n",
    "#del raw_df['Unnamed: 0']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df['cl_id'].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(raw_df[raw_df['target_flag'] == 1]['target_flag'].dropna(), color='red', alpha=0.3, bins=30);\n",
    "plt.hist(raw_df[raw_df['target_flag'] == 0]['target_flag'].dropna(), color='green', alpha=0.5, bins=30);\n",
    "print(round((raw_df[raw_df['target_flag'] == 1]['cl_id'].nunique() / raw_df['cl_id'].nunique())*100,1), '% of taget = 1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.info()\n",
    "raw_df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print( 'Number of unique clients =',raw_df['cl_id'].nunique())\n",
    "print ('At channel_type column there are ', round(100*(len(raw_df[raw_df['channel_type'].isna()]) / len(raw_df)),1), '% of empty cells')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Let's plot where in dimention distribution of target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = raw_df[['cl_id','target_flag']].groupby('cl_id').agg('max').reset_index()\n",
    "ind = X['target_flag']==0\n",
    "plt.plot(X['cl_id'][ind], np.random.rand(np.sum(ind)), 'g.', label='negative case')\n",
    "ind = X['target_flag']==1\n",
    "plt.plot(X['cl_id'][ind], np.random.rand(np.sum(ind)), 'b.', label='positive case')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# From dataset we see, that:\n",
    "MCC is not int number, it have to be categorial data(we will find descriptions in the Internet);  Currency should be also categorial data\n",
    "\n",
    "In total there are 490513 transactions for 3 month, by made 5000 clients. In general it's about \n",
    "98 transactions by 1 client for 3 month of using card.\n",
    "\n",
    "Time period is from 01/01/2017' to '01/12/2017'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From description we see that there are empty cells in channel_type feature. Let's fill them into 'type6'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.channel_type.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#From description we see that there are empty cells in channel_type feature. Let's fill them into 'type6'\n",
    "\n",
    "raw_df.channel_type.fillna('type6', inplace = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From description we see that PERIOD and TRDATETIME have Object type. and strange format. let's parse it and convert to datetime format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime, date, time\n",
    "\n",
    "raw_df['PERIOD'] = raw_df['PERIOD'].apply(pd.to_datetime)\n",
    "\n",
    "# Creating separate cols for yr, month,...\n",
    "raw_df['Year'] = raw_df.TRDATETIME.str[5:7]\n",
    "raw_df['Month'] = raw_df.TRDATETIME.str[2:5]\n",
    "raw_df['Date'] = raw_df.TRDATETIME.str[0:2]\n",
    "raw_df['Hour'] = raw_df.TRDATETIME.str[8:10]\n",
    "\n",
    "# Replace month with ints\n",
    "raw_df.Month = raw_df.Month.replace(to_replace=['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN','JUL','AUG',\\\n",
    "                                                'SEP','OCT','NOV','DEC' ], value=[1,2,3,4,5,6,7,8,9,10,11,12])\n",
    "\n",
    "raw_df.Year = raw_df.Year.apply(pd.to_numeric)\n",
    "raw_df.Date = raw_df.Date.apply(pd.to_numeric)\n",
    "raw_df.Month= raw_df.Month.apply(pd.to_numeric)\n",
    "raw_df.Hour = raw_df.Hour.apply(pd.to_numeric)\n",
    "raw_df.Year = raw_df.Year + 2000\n",
    "\n",
    "# making date format\n",
    "def to_date(row):    \n",
    "    return date(row[10], row[11], row[12])\n",
    "raw_df['DateFormat'] = raw_df.apply(to_date, axis=1)\n",
    "\n",
    "# making Quater of the Year feature \n",
    "def Quater(row):\n",
    "    if row['Month']in [1, 2, 3]:\n",
    "        return 1\n",
    "    if row['Month']in [4, 5, 6]:\n",
    "        return 2   \n",
    "    if row['Month']in [7, 8, 9]:\n",
    "        return 3 \n",
    "    if row['Month']in [10, 11, 12]:\n",
    "        return 4\n",
    "\n",
    "# Applying features is the day is weekend and quater\n",
    "raw_df['quater_of_year'] = raw_df.apply(Quater, axis = 1)\n",
    "raw_df['weekend'] = raw_df['DateFormat'].astype('datetime64[ns]')\n",
    "raw_df['weekend'] = ((raw_df.weekend.dt.dayofweek) // 5 ==1).astype(float)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.currency.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.trx_category.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### In python there is a library which can convert currancy. Let's convert all amounts to rubles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from currency_converter import CurrencyConverter\n",
    "\n",
    "converter = CurrencyConverter(fallback_on_missing_rate=True, fallback_on_wrong_date=True)\n",
    "converter_currencies = converter.currencies\n",
    "\n",
    "def convert_to_rub(amount, currency, day):\n",
    "    if currency == 'RUB':\n",
    "        return amount\n",
    "    else:\n",
    "        if currency in converter_currencies:\n",
    "            return converter.convert(amount, currency, 'RUB', date = day)\n",
    "        else: amount     \n",
    "        return amount\n",
    "    \n",
    "# also from task descripttion we know that there are cash in and cash out . \n",
    "# It that logic make the functions which convert in (+) or (-)\n",
    "    \n",
    "def cash_in_out(raw):\n",
    "    if raw['trx_category'] == 'POS':\n",
    "        return raw['amount']*(-1)\n",
    "    if raw['trx_category'] == 'C2C_OUT':\n",
    "        return raw['amount']*(-1)\n",
    "    if raw['trx_category'] == 'WD_ATM_PARTNER':\n",
    "        return raw['amount']*(-1)\n",
    "    if raw['trx_category'] == 'WD_ATM_ROS':\n",
    "        return raw['amount']*(-1)    \n",
    "    else:\n",
    "        return raw['amount']\n",
    "    \n",
    "raw_df['amount'] = raw_df.apply(lambda x: convert_to_rub(x['amount'], x['currency'], x['DateFormat']), axis = 1)\n",
    "raw_df['cash_in_out'] = raw_df.apply(cash_in_out, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's create a feature that discribe currnecy  Rub, Dollar ,Euro other"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def Is_rub(raw):\n",
    "    if raw['currency'] == 810:\n",
    "        return 'Rub'\n",
    "    if raw['currency'] == 643:\n",
    "        return 'Rub'\n",
    "    if raw['currency'] ==840:\n",
    "        return '$'\n",
    "    if raw['currency'] == 978:\n",
    "        return 'Euro'\n",
    "    else: \n",
    "        return 'other'\n",
    "raw_df['cur']= raw_df.apply(Is_rub, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Let create a feature of start and end of perion of using the card"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_date = raw_df[['cl_id', 'DateFormat']].groupby('cl_id').max().reset_index()\n",
    "max_date.columns = ['cl_id', 'last_action']\n",
    "\n",
    "raw_df = pd.merge(raw_df, max_date, how='left', on='cl_id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### I have a hypothesis that if client continue to usethe card actively in the end of period of  preferential use period he will continue to use card after . So let's create features for last 14 days and last 30 days before the end of period"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df['last_action'] = pd.to_datetime(raw_df['last_action'])\n",
    "\n",
    "raw_df['last_14_days'] = raw_df['last_action'] - timedelta(days=14)\n",
    "raw_df['last_30_days'] = raw_df['last_action'] - timedelta(days=30)\n",
    "\n",
    "raw_df['DateFormat'] = pd.to_datetime(raw_df['DateFormat'])\n",
    "raw_df['last_14_days']= pd.to_datetime(raw_df['last_14_days'])\n",
    "raw_df['last_30_days'] = pd.to_datetime(raw_df['last_30_days'])\n",
    "\n",
    "def last_14_days1(row):\n",
    "    if row['DateFormat']>=row['last_14_days']:\n",
    "        return 1\n",
    "\n",
    "def last_30_days1(row):\n",
    "    if row['DateFormat']>=row['last_30_days']:\n",
    "        return 1\n",
    "    \n",
    "raw_df['last_14_days'] = raw_df.apply(last_14_days1, axis=1)\n",
    "raw_df['last_30_days'] = raw_df.apply(last_30_days1, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### The next one quation to solve is MCC (Merchnt Category Code) codes. This are codes identifies the kind of operation of client. For "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mcc_codes = pd.read_excel('mcc_codes1.xlsx')\n",
    "mcc_codes.columns = ['MCC', 'Name' , 'Group']\n",
    "\n",
    "raw_df = pd.merge(raw_df, mcc_codes, 'left', on=['MCC'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mcc_codes.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### MCC codes have current name and grouped name. We will use it for groupby functions\n",
    "##### One of MCC codes means cashback from POS operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cashback(raw):\n",
    "    if raw['trx_category'] == 'POS':\n",
    "        return raw['amount']*0.02\n",
    "raw_df['cashback'] = raw_df.apply(cashback, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Finaly we have"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### All NaN's  mean that  they are equal to 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df = raw_df.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save it fo file \n",
    "#raw_df.to_csv('rosbank_train1.csv')\n",
    "#raw_df = pd.read_csv('rosbank_train1.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Okay it seems that we prepare dataset for further groupby functtions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Doing some aggregations to generate sample of unique clients. General approach is to count for categorial features and aggregate by for numerical ['max', 'min', 'mean', 'count' , 'sum']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def days_in_use(x):\n",
    "    return (np.max(x) - np.min(x)).days\n",
    "\n",
    "days_usage = raw_df[['cl_id','DateFormat']].groupby('cl_id').agg(days_in_use)\n",
    "\n",
    "days_usage['target_flag'] = raw_df['target_flag']\n",
    "\n",
    "max_date = raw_df[['cl_id', 'DateFormat']].groupby('cl_id').max()\n",
    "days_usage['days_from_end_period']= (max(raw_df['DateFormat']) - max_date['DateFormat']).dt.days\n",
    "\n",
    "\n",
    "num_trans_total = raw_df[['cl_id','DateFormat']].groupby('cl_id').agg('count').reset_index()\n",
    "num_trans_total.columns = ['cl_id', 'num_trans_total']\n",
    "num_trans_total.index=num_trans_total.cl_id\n",
    "days_usage['Num_trans_total'] = num_trans_total.num_trans_total\n",
    "\n",
    "num_trans_month = raw_df[['cl_id','Month']].groupby('cl_id').agg(['max', 'min', 'mean', 'count', 'sum']).reset_index()\n",
    "#num_trans_month.columns = ['cl_id', 'num_trans_month']\n",
    "num_trans_month.index=num_trans_month.cl_id\n",
    "days_usage[num_trans_month.columns] = num_trans_month\n",
    "\n",
    "\n",
    "balance_on_end_of_period = raw_df[['cl_id', 'cash_in_out']].groupby('cl_id').agg(['max', 'min', 'mean', 'count','sum']).reset_index()\n",
    "#balance_on_end_of_period.columns=['cl_id', 'balance_on_end_of_period']\n",
    "balance_on_end_of_period.index=balance_on_end_of_period.cl_id\n",
    "days_usage[balance_on_end_of_period.columns] = balance_on_end_of_period\n",
    "\n",
    "cashback = raw_df[['cl_id', 'cashback']].groupby('cl_id').sum().reset_index()\n",
    "cashback.columns=['cl_id', 'cashback']\n",
    "cashback.index=cashback.cl_id\n",
    "days_usage['cashback'] = cashback.cashback\n",
    "\n",
    "\n",
    "spent_trx_category = raw_df[['cl_id', 'trx_category' ,'amount']].groupby(['cl_id', 'trx_category']).sum().\\\n",
    "                                                                                        unstack().reset_index()\n",
    "spent_trx_category=spent_trx_category.fillna(0)\n",
    "spent_trx_category.columns = ['cl_id', 'BACK_TRX', 'C2C_IN', 'C2C_OUT', 'CASH_ADV', 'CAT', 'DEPOSIT',\n",
    "       'POS', 'WD_ATM_OTHER', 'WD_ATM_PARTNER', 'WD_ATM_ROS'] \n",
    "spent_trx_category.index= spent_trx_category.cl_id\n",
    "\n",
    "days_usage[['BACK_TRX', 'C2C_IN', 'C2C_OUT', 'CASH_ADV', 'CAT', 'DEPOSIT',\n",
    "       'POS', 'WD_ATM_OTHER', 'WD_ATM_PARTNER', 'WD_ATM_ROS']] = spent_trx_category[['BACK_TRX', 'C2C_IN', 'C2C_OUT',\\\n",
    "                                 'CASH_ADV', 'CAT', 'DEPOSIT','POS', 'WD_ATM_OTHER', 'WD_ATM_PARTNER', 'WD_ATM_ROS']]\n",
    "\n",
    "quntity_of_mcc = raw_df[['cl_id','MCC']].groupby(['cl_id','MCC']).apply(lambda x: x.count()).unstack().\\\n",
    "                                                                                        max(axis=1).reset_index()\n",
    "quntity_of_mcc.columns=['cl_id', 'quntity_of_mcc']\n",
    "quntity_of_mcc.index= quntity_of_mcc.cl_id\n",
    "days_usage['quntity_of_mcc']=quntity_of_mcc.quntity_of_mcc\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "multy_currency = raw_df[['cl_id', 'currency']].groupby(['cl_id', 'currency']).first().reset_index()\n",
    "multy_currency = multy_currency.groupby(['cl_id']).count()\n",
    "#multy_currency.index= multy_currency.cl_id\n",
    "\n",
    "days_usage['multy_currency']=multy_currency.currency\n",
    "\n",
    "\n",
    "\n",
    "last_14_days=raw_df[['cl_id', 'last_14_days']].groupby(['cl_id']).agg('sum')\n",
    "last_30_days=raw_df[['cl_id', 'last_30_days']].groupby(['cl_id']).agg('sum')\n",
    "\n",
    "days_usage[last_14_days.columns]=last_14_days\n",
    "days_usage[last_30_days.columns]=last_30_days\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "group_mcc = pivot_table(raw_df, values='cash_in_out', \n",
    "                    index=['cl_id'], columns=['Group'], aggfunc=lambda cash_in_out: len(cash_in_out.unique())).fillna(0)\n",
    "group_mcc2 = pivot_table(raw_df, values='cash_in_out', \n",
    "                    index=['cl_id'], columns=['Group'], aggfunc=np.sum).fillna(0)\n",
    "\n",
    "mcc = pd.merge(group_mcc, group_mcc2, 'left', on=days_usage.index)\n",
    "\n",
    "mcc.index=mcc.key_0\n",
    "days_usage[mcc.columns]= mcc\n",
    "del days_usage['key_0']\n",
    "\n",
    "trx_category = raw_df[['cl_id','trx_category', 'cash_in_out']].groupby(['cl_id','trx_category']).agg(['max', \\\n",
    "                                                                'min', 'mean', 'count', 'sum']).unstack()\n",
    "days_usage[trx_category.columns]=trx_category\n",
    "\n",
    "\n",
    "quater_of_year = raw_df[['cl_id', 'quater_of_year']].groupby('cl_id').agg('sum').reset_index()\n",
    "#quater_of_year.columns=['cl_id', 'quater_of_year']\n",
    "quater_of_year.index=quater_of_year.cl_id\n",
    "days_usage[quater_of_year.columns] = quater_of_year\n",
    "\n",
    "\n",
    "\n",
    "last_action = raw_df[['cl_id', 'last_action']].groupby('cl_id').count().reset_index()\n",
    "last_action.columns=['cl_id', 'last_action']\n",
    "last_action.index=last_action.cl_id\n",
    "days_usage['last_action'] = last_action.last_action\n",
    "\n",
    "cur = raw_df[['cl_id','cur', 'cash_in_out']].groupby(['cl_id','cur']).agg(['max', 'min', 'mean', 'count', 'sum'\\\n",
    "                                                                          ]).unstack()\n",
    "#cur.index=cur.cl_id\n",
    "days_usage[cur.columns]=cur"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "last_14_days = raw_df[['cl_id', 'last_14_days']].groupby('cl_id').agg(['max', 'min', 'mean', 'count', 'sum'\\\n",
    "                                                                      ]).reset_index()\n",
    "#last_14_days.columns=['cl_id', 'last_14_days']\n",
    "last_14_days.index=last_14_days.cl_id\n",
    "days_usage[last_14_days.columns] = last_14_days\n",
    "\n",
    "last_30_days = raw_df[['cl_id', 'last_30_days']].groupby('cl_id').agg(['max', 'min', 'mean', 'count', 'sum'\\\n",
    "                                                                      ]).reset_index()\n",
    "#last_30_days.columns=['cl_id', 'last_30_days']\n",
    "last_30_days.index=last_14_days.cl_id\n",
    "days_usage[last_30_days.columns] = last_30_days"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Another one feature could be relation between (all count of transactions)  / to (count of transactions by 14 last days and 30 last days)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days_usage['all_to_last14'] =  days_usage['last_14_days'] / days_usage['Num_trans_total']\n",
    "days_usage['all_to_last30'] =  days_usage['last_30_days'] / days_usage['Num_trans_total']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We've got a dataset with 139 aggregated features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days_usage.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days_usage.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### For sure there are a lot of NaN because of aggregations. So, if Nan, fill it by 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days_usage=days_usage.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days_usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days_usage.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days_usage.to_csv('days_us.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Plotting visual info. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(days_usage.corr())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### From the graph we see that there are a lot of correletad features. Let's find them and drop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# чтобы убрать все кореллирующие признакие\n",
    "def drop_corr_col(df_corr):\n",
    "    upper = df_corr.where(np.triu(np.ones(df_corr.shape),\n",
    "                          k=1).astype(np.bool))\n",
    "    to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]\n",
    "    return(to_drop)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr=days_usage.corr().abs()\n",
    "drop_col=drop_corr_col(corr)\n",
    "print('We found and drop',len(drop_col), 'correlated features with the coefficient more then 0.9')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let's make PCA with 2 components from all of them and then add this two components into dataset , others drop\n",
    "from sklearn.decomposition import PCA\n",
    "pca = PCA(n_components=2).fit_transform(days_usage[drop_col])\n",
    "\n",
    "pca_df = pd.DataFrame(pca, columns=['pca1', 'pca2'])\n",
    "pca_df.index=days_usage.index\n",
    "\n",
    "days=days_usage.drop(drop_col, axis=1)\n",
    "days[pca_df.columns]= pca_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "days.to_csv('days.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(days[days['target_flag'] == 1]['target_flag'].dropna(), color='red', alpha=0.3, bins=30);\n",
    "plt.hist(days[days['target_flag'] == 0]['target_flag'].dropna(), color='green', alpha=0.5, bins=30);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(days['last_14_days'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(x='DateFormat', data=days);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(x='Num_trans_total', data=days);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(days_usage[['DateFormat','days_from_end_period',\n",
    "                         'multy_currency', 'quntity_of_mcc'  ] ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Okay, we prepared Dataset , saw on the distribution of some features. Let's choose metrics to evaluate quality of future models.\n",
    "##### We have task of binary classification, so we will see on metrics ROCAUC.  Also, we have disbalance of target classes, but difference is not too big. So Accuracy and precision are good too. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import roc_auc_score, roc_curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, confusion_matrix, auc\n",
    "from sklearn.model_selection import KFold, StratifiedKFold\n",
    "\n",
    "def calc_auc(y_test2, y_pred, plot_label='', prin=True):\n",
    "    fpr, tpr, _ = roc_curve(y_test2, y_pred)\n",
    "    auc_val = auc(fpr, tpr)\n",
    "    if prin:\n",
    "        print('ROC AUC: {0:.4f}'.format(auc_val))\n",
    "    if plot_label:\n",
    "        plt.plot(fpr, tpr, label=plot_label)\n",
    "        plt.xlabel('FPR')\n",
    "        plt.ylabel('TPR')\n",
    "    return auc_val"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Also make a funcion to plot confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_confusion_matrix(cm, classes,\n",
    "                          normalize=False,\n",
    "                          title='Confusion matrix',\n",
    "                          cmap=plt.cm.Blues):\n",
    "\n",
    "    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
    "    plt.title(title)\n",
    "    plt.colorbar()\n",
    "    tick_marks = np.arange(len(classes))\n",
    "    plt.xticks(tick_marks, classes, rotation=45)\n",
    "    plt.yticks(tick_marks, classes)\n",
    "\n",
    "    if normalize:\n",
    "        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
    "        print(\"Normalized confusion matrix\")\n",
    "    else:\n",
    "        print('Confusion matrix, without normalization')\n",
    "\n",
    "    print(cm)\n",
    "\n",
    "    thresh = cm.max() / 2.\n",
    "    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
    "        plt.text(j, i, cm[i, j],\n",
    "                 horizontalalignment=\"center\",\n",
    "                 color=\"white\" if cm[i, j] > thresh else \"black\")\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.ylabel('True label')\n",
    "    plt.xlabel('Predicted label')\n",
    "\n",
    "font = {'size' : 15}\n",
    "\n",
    "plt.rc('font', **font)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We are ready to build models. \n",
    "\n",
    "To do it we will make Train_test_split. Because we have disbalanced classes and less number of user is oldest one, we will shake them by stratified shaffle split. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = days.copy()\n",
    "y = days.target_flag\n",
    "X = X.reset_index()\n",
    "X.drop([('cl_id', '')], axis=1)\n",
    "del X['target_flag']\n",
    "del X[('cl_id', '')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import StratifiedShuffleSplit\n",
    "\n",
    "splitter = StratifiedShuffleSplit(n_splits=2, test_size=0.3, random_state=17)\n",
    "\n",
    "for train_index, test_index in splitter.split(X, y):\n",
    "    X_train = X.iloc[train_index]\n",
    "    X_test = X.iloc[test_index]\n",
    "    \n",
    "    y_train = y.iloc[train_index]\n",
    "    y_test = y.iloc[test_index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgboost\n",
    "from sklearn.metrics import roc_auc_score, roc_curve\n",
    "xgb = xgboost.XGBClassifier(learning_rate=0.1, max_depth=5, n_jobs=-1)\n",
    "xgb.fit(X_train, y_train)\n",
    "y_train_predict = xgb.predict_proba(X_train)[:, 1]\n",
    "y_test_predict = xgb.predict_proba(X_test)[:, 1]\n",
    "roc_auc_train = np.round(roc_auc_score(y_train, y_train_predict), 2)\n",
    "roc_auc_test = np.round(roc_auc_score(y_test, y_test_predict), 2)\n",
    "print(\"Train: \", roc_auc_train)\n",
    "print(\"Test: \", roc_auc_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from xgboost import plot_importance\n",
    "plot_importance(xgb, max_num_features = 15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## We see that cl_id is data leak. Drop it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del X_train['cl_id']\n",
    "del X_test['cl_id']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgboost\n",
    "from sklearn.metrics import roc_auc_score, roc_curve\n",
    "xgb = xgboost.XGBClassifier( n_jobs=-1)\n",
    "xgb.fit(X_train, y_train)\n",
    "y_train_predict = xgb.predict_proba(X_train)[:, 1]\n",
    "y_test_predict = xgb.predict_proba(X_test)[:, 1]\n",
    "roc_auc_train = np.round(roc_auc_score(y_train, y_train_predict), 2)\n",
    "roc_auc_test = np.round(roc_auc_score(y_test, y_test_predict), 2)\n",
    "print(\"Train: \", roc_auc_train)\n",
    "print(\"Test: \", roc_auc_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from xgboost import plot_importance\n",
    "plot_importance(xgb, max_num_features = 15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bad results. Lets buid simple Logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.linear_model import LinearRegression, LogisticRegression\n",
    "\n",
    "scaler= StandardScaler()\n",
    "X_scaled_train = scaler.fit_transform(X_train)\n",
    "X_scaled_test = scaler.transform(X_test)\n",
    "\n",
    "lr=LogisticRegression()\n",
    "lr.fit(X_scaled_train,y_train)\n",
    "y_pred_train= lr.predict_proba(X_scaled_train)\n",
    "y_pred_test= lr.predict_proba(X_scaled_test)\n",
    "\n",
    "#.fit(x_train1_l1,y_train).score(x_train1_l1,y_train)\n",
    "#print(score)\n",
    "print(np.round(roc_auc_score(y_train, y_pred_train[:,1]), 2))\n",
    "print(np.round(roc_auc_score(y_test, y_pred_test[:,1]), 2))\n",
    "\n",
    "y_pred_rf_test1 = lr.predict_proba(X_scaled_test)[:, 1]\n",
    "y_pred_rf_train1 = lr.predict_proba(X_scaled_train)[:, 1]\n",
    "\n",
    "print('Train:')\n",
    "calc_auc(y_train, y_pred_rf_train1, 'train')\n",
    "print('Test:')\n",
    "calc_auc(y_test, y_pred_rf_test1, 'test')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A little bit better then random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost import CatBoost, CatBoostClassifier\n",
    "\n",
    "model = CatBoostClassifier( )\n",
    "\n",
    "model.fit(\n",
    "    X_train, y_train,\n",
    "    #cat_features=categorical_features_indices,\n",
    "    eval_set=(X_test, y_test),\n",
    "    logging_level='Silent',\n",
    "    plot=True\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Over fitting and the best result so far"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_rf_test1 = model.predict_proba(X_test)[:, 1]\n",
    "y_pred_rf_train1 = model.predict_proba(X_train)[:, 1]\n",
    "\n",
    "print('Train:')\n",
    "calc_auc(y_train, y_pred_rf_train1, 'train')\n",
    "print('Test:')\n",
    "calc_auc(y_test, y_pred_rf_test1, 'test')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets try GridserchCV with XGGClassifier. \n",
    "With Cross Validation \n",
    "numbor of fold = 5 \n",
    "scoring = roc auc\n",
    "shuffle true\n",
    "and different paramenters "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import *\n",
    "from sklearn.grid_search import GridSearchCV\n",
    "import xgboost as xgb\n",
    "\n",
    "parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower\n",
    "              'objective':['binary:logistic'],\n",
    "              'learning_rate': [0.01,0.1,1 ], #so called `eta` value\n",
    "              'max_depth': [4,10],\n",
    "              'silent': [1],\n",
    "              'subsample': [0.8],\n",
    "              'colsample_bytree': [0.7],\n",
    "              'n_estimators': [500, 1000], \n",
    "              'missing':[-999],\n",
    "              'seed': [1337]}\n",
    "\n",
    "xgb_model = xgb.XGBClassifier()\n",
    "clf = GridSearchCV(xgb_model, parameters, n_jobs=5, \n",
    "                   cv=StratifiedKFold(y_train, n_folds=5, shuffle=True), \n",
    "                   scoring='roc_auc',\n",
    "                   verbose=2, refit=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del X_train['target_flag']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_clf_test1 = clf.predict_proba(X_test)[:, 1]\n",
    "y_pred_clf_train1 = clf.predict_proba(X_train)[:, 1]\n",
    "\n",
    "print('Train:')\n",
    "calc_auc(y_train, y_pred_clf_train1, 'train')\n",
    "print('Test:')\n",
    "calc_auc(y_test, y_pred_clf_test1, 'test')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import precision_recall_curve, classification_report\n",
    "report = classification_report(y_test, clf.predict(X_test))\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### So, about the half of samples we predict incorrect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools\n",
    "cnf_matrix = confusion_matrix(y_test, clf.predict(X_test))\n",
    "plt.figure(figsize=(10, 6))\n",
    "plot_confusion_matrix(cnf_matrix, classes=['Not_continue', 'Continue to use'],\n",
    "                      title='Confusion matrix')\n",
    "plt.savefig(\"conf_matrix.png\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "\n",
    "Best model was Catboost\n",
    "\n",
    "We made lots of feature engeneering and data preprocessing, but we can't predict thouse client who will not use our card.\n",
    "For further we have to create more features: \n",
    "1. To combine numbers into groups\n",
    "2. Cout WOE (Weight of Evidence on this groups\n",
    "3. Make Time series of transactions for each client\n",
    "4. Also 5000 of clients is not representive sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
